CMPINF 2120 Final Project¶

Ryan Negron¶

EDA Notebook¶

Data Clean-up and Manipulation¶

At the beginning of this notebook it can be seen that I am seeking to clean and manipulate both the game data and the team statistic data frames. A little extra work is done here because in my initial proposal I was not looking into team statistics quite yet. Here more EDA can be seen with the new inputs from looking at team stats across the seasons.

Game_id is kept as a sort of index and reference point for all of the data. The data for indoor stadiums needed to be manipulated as there was missing data for both temp and wind in these environments. After significant research I settled on values of 1 for all wind in dome stadiums and 3 for closed retractable roof stadiums. I then held constant all temperature for indoor stadiums to simply be room temperature.

The data contained in game data vs team data was overlapping by all years except for the 2023 season, so this analysis is only up to date to the end of the 2022 season.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from patsy import dmatrices
In [3]:
team_stats = pd.read_csv('nfl-team-statistics.csv')
In [4]:
team_stats_copy = team_stats.copy()
team_stats_copy['home_team']= team_stats['team']
team_stats_copy['away_team']= team_stats['team']
In [5]:
stats_list = ['home_team', 'season', 'offense_completion_percentage', 'offense_ave_yards_gained_pass', 'offense_ave_yards_gained_run', 'defense_ave_yards_gained_pass', 'defense_ave_yards_gained_run', 'points_allowed']
In [6]:
trim_stats = team_stats_copy[stats_list]
In [7]:
nfl_raw ='https://raw.githubusercontent.com/nflverse/nfldata/master/data/games.csv'
nfl_game_data = pd.read_csv(nfl_raw)
In [8]:
nfl_copy = nfl_game_data.copy()
In [9]:
columns_list = ['game_id', 'season', 'total', 'overtime', 'away_team', 'home_team', 'total_line', 'roof', 'surface', 'temp', 'wind']
trim_nfl = nfl_copy[columns_list]
In [10]:
nfl_df = trim_nfl.copy()
nfl_df['point_total_reached'] = np.where(trim_nfl.total - trim_nfl.total_line >= 0, 1,0)
In [11]:
nfl_df_copy = nfl_df.copy()
nfl_df_copy.surface.fillna(value = 'UNKNOWN', inplace = True)
In [12]:
nfl_df_copy['opponent'] = nfl_df_copy.loc[:, 'away_team']
In [13]:
ready_nfl_df = nfl_df_copy.drop(columns = 'away_team')
In [14]:
ready_nfl_df.loc[ ready_nfl_df['opponent'] == 'OAK', 'opponent'] = 'LV'
ready_nfl_df.loc[ ready_nfl_df['home_team'] == 'OAK', 'home_team'] = 'LV'
ready_nfl_df.loc[ ready_nfl_df['home_team'] == 'SD', 'home_team'] = 'LAC'
ready_nfl_df.loc[ ready_nfl_df['opponent'] == 'SD', 'opponent'] = 'LAC'
ready_nfl_df.loc[ ready_nfl_df['home_team'] == 'STL', 'home_team'] = 'LA'
ready_nfl_df.loc[ ready_nfl_df['opponent'] == 'STL', 'opponent'] = 'LA'
ready_nfl_df['overtime'] = np.where(ready_nfl_df.overtime.values == 1, 'YES', 'NO')
In [15]:
nfl_merged_df = pd.merge(ready_nfl_df, trim_stats, how = 'outer', on = ['season', 'home_team'])
In [16]:
nfl_merged_df['wind'] = np.where(nfl_merged_df['roof'] == 'closed', 3, nfl_merged_df['wind'])
In [17]:
nfl_merged_df['wind'] = np.where(nfl_merged_df['roof'] == 'dome', 1, nfl_merged_df['wind'])
In [18]:
nfl_merged_df['temp'] = np.where(nfl_merged_df['roof'] == 'closed', nfl_merged_df.temp.mean(), nfl_merged_df['temp'])
In [19]:
nfl_merged_df['temp'] = np.where(nfl_merged_df['roof'] == 'dome', 70.0, nfl_merged_df['temp'])
In [20]:
nfl_merged_df.dropna(inplace = True)
In [23]:
nfl_merged_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6215 entries, 0 to 6420
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   game_id                        6215 non-null   object 
 1   season                         6215 non-null   int64  
 2   total                          6215 non-null   int64  
 3   overtime                       6215 non-null   object 
 4   home_team                      6215 non-null   object 
 5   total_line                     6215 non-null   float64
 6   roof                           6215 non-null   object 
 7   surface                        6215 non-null   object 
 8   temp                           6215 non-null   float64
 9   wind                           6215 non-null   float64
 10  point_total_reached            6215 non-null   int64  
 11  opponent                       6215 non-null   object 
 12  offense_completion_percentage  6215 non-null   float64
 13  offense_ave_yards_gained_pass  6215 non-null   float64
 14  offense_ave_yards_gained_run   6215 non-null   float64
 15  defense_ave_yards_gained_pass  6215 non-null   float64
 16  defense_ave_yards_gained_run   6215 non-null   float64
 17  points_allowed                 6215 non-null   float64
dtypes: float64(9), int64(3), object(6)
memory usage: 922.5+ KB
In [33]:
sns.relplot(data = nfl_merged_df, x = 'points_allowed', y = 'offense_ave_yards_gained_pass', hue = 'point_total_reached', kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [25]:
sns.relplot(data = nfl_merged_df, x = 'points_allowed', y = 'offense_ave_yards_gained_pass', hue = 'home_team', col = 'point_total_reached', kind = 'scatter')

plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [27]:
sns.relplot(data = nfl_merged_df, x = 'points_allowed', y = 'offense_ave_yards_gained_run', hue = 'home_team', col = 'point_total_reached', kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [29]:
sns.relplot(data = nfl_merged_df, x = 'temp', y = 'wind', hue = 'home_team', col = 'point_total_reached', kind = 'scatter')

plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [60]:
sns.relplot(data = nfl_merged_df, x = 'defense_ave_yards_gained_pass', y = 'offense_ave_yards_gained_run', hue = 'point_total_reached', col = 'home_team', col_wrap = 4, kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [43]:
sns.catplot(data = nfl_merged_df, y= 'home_team', hue = 'point_total_reached', kind = 'count')

plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [57]:
sns.catplot(data = nfl_merged_df, x = 'overtime', hue = 'point_total_reached', kind = 'count')

plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [45]:
sns.pairplot(data = nfl_merged_df, hue = 'point_total_reached')

plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

EDA Summary¶

When performing EDA, I noticed no real correlation besides the two inputs that helped create the output (total and total line). Of course, the reasoning for this could be that Vegas performs modeling on a completely different level. They could already be factoring in all of these inputs and tons more, making for the near perfect split of teams to reach their point total across 20+ seasons. The only real factor that showed any sort of skew was if a game went into overtime, which provides additional game time for potential scoring. One other thing that was apparent was the need to standardize these variables. The distribution was Gaussian for all inputs so no transformations were truly necessary.

With that in mind, when I took these results to modeling I considered altering my threshold, assessing performance through a different lens. If a model can correctly classify event/non-event even slightly above 50 percent, it can be viewed as accurate in comparison to the lines set. At this point I know there are way more complicated factors to include if truly seeking to gain any edge but this was an interesting starting point.

In [ ]: